Today we will…
tidyrdplyrmutate() vs summarise()Tidy data…
R4DS
Allison Horst
We may need to transform our data to turn it into the version of tidy that is best for a task at hand.
Allison Horst
Let’s say we want to look at mean cereal nutrients based on shelf.
| name | manuf | type | calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins | shelf | weight | cups | rating |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 100% Bran | N | cold | 70 | 4 | 1 | 130 | 10.0 | 5.0 | 6 | 280 | 25 | 3 | 1 | 0.33 | 68.40297 |
| 100% Natural Bran | Q | cold | 120 | 3 | 5 | 15 | 2.0 | 8.0 | 8 | 135 | 0 | 3 | 1 | 1.00 | 33.98368 |
| All-Bran | K | cold | 70 | 4 | 1 | 260 | 9.0 | 7.0 | 5 | 320 | 25 | 3 | 1 | 0.33 | 59.42551 |
| All-Bran with Extra Fiber | K | cold | 50 | 4 | 0 | 140 | 14.0 | 8.0 | 0 | 330 | 25 | 3 | 1 | 0.50 | 93.70491 |
| Almond Delight | R | cold | 110 | 2 | 2 | 200 | 1.0 | 14.0 | 8 | -1 | 25 | 3 | 1 | 0.75 | 34.38484 |
| Apple Cinnamon Cheerios | G | cold | 110 | 2 | 2 | 180 | 1.5 | 10.5 | 10 | 70 | 25 | 1 | 1 | 0.75 | 29.50954 |
| shelf | calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 102.5000 | 2.650000 | 0.60 | 176.2500 | 1.6850000 | 15.80000 | 4.800000 | 75.50000 | 20.00000 |
| 2 | 109.5238 | 1.904762 | 1.00 | 145.7143 | 0.9047619 | 13.61905 | 9.619048 | 57.80952 | 23.80952 |
| 3 | 107.7778 | 2.861111 | 1.25 | 158.6111 | 3.1388889 | 14.50000 | 6.527778 | 129.83333 | 35.41667 |
my_colors <- c("calories_col" = "steelblue", "sugars_col" = "orange3")
cereal_wide |>
ggplot() +
geom_point(aes(x = shelf, y = calories, color = "calories_col")) +
geom_line(aes(x = shelf, y = calories, color = "calories_col")) +
geom_point(aes(x = shelf, y = sugars, color = "sugars_col")) +
geom_line(aes(x = shelf, y = sugars, color = "sugars_col")) +
scale_color_manual(values = my_colors, labels = names(my_colors)) +
labs(x = "Shelf", y = "", subtitle = "Mean Amount", color = "Nutrient")| shelf | Nutrient | mean_amount |
|---|---|---|
| 1 | calories | 102.5000000 |
| 1 | carbo | 15.8000000 |
| 1 | fat | 0.6000000 |
| 1 | fiber | 1.6850000 |
| 1 | potass | 75.5000000 |
| 1 | protein | 2.6500000 |
| 1 | sodium | 176.2500000 |
| 1 | sugars | 4.8000000 |
| 1 | vitamins | 20.0000000 |
| 2 | calories | 109.5238095 |
| 2 | carbo | 13.6190476 |
| 2 | fat | 1.0000000 |
| 2 | fiber | 0.9047619 |
| 2 | potass | 57.8095238 |
| 2 | protein | 1.9047619 |
| 2 | sodium | 145.7142857 |
| 2 | sugars | 9.6190476 |
| 2 | vitamins | 23.8095238 |
| 3 | calories | 107.7777778 |
| 3 | carbo | 14.5000000 |
| 3 | fat | 1.2500000 |
| 3 | fiber | 3.1388889 |
| 3 | potass | 129.8333333 |
| 3 | protein | 2.8611111 |
| 3 | sodium | 158.6111111 |
| 3 | sugars | 6.5277778 |
| 3 | vitamins | 35.4166667 |
Consider daily rainfall observed in SLO in January 2023.
How would you manually convert this to long format?
Week.Day_of_Week.Rainfall (hold daily rainfall values).Week.Day_of_Week.Rainfall (hold daily rainfall values).Week 1-5 and copy Monday values over.Week.Day_of_Week.Rainfall (hold daily rainfall values).Week 1-5 and copy Monday values over.Week 1-5 and copy Tuesday values over.Week.Day_of_Week.Rainfall (hold daily rainfall values).Week 1-5 and copy Monday values over.Week 1-5 and copy Tuesday values over.Week.Day_of_Week.Rainfall (hold daily rainfall values).Week 1-5 and copy Monday values over.Week 1-5 and copy Tuesday values over.arrange() by Week to get the rainfall values chronological order.We can use pivot_longer() to turn a wide dataset into a long(er) dataset.
pivot_longer()Take a wide dataset and turn it into a long daaset.
cols – specify the columns that should be pivoted.
names_to – the name of the new column containing the old column names.values_to – the name of the new column containing the old column values.pivot_longer()| Week | Day_of_Week | Daily_Rainfall |
|---|---|---|
| 1 | Sunday | 0.00 |
| 1 | Monday | 0.12 |
| 1 | Tuesday | 0.00 |
| 1 | Wednesday | 1.58 |
| 1 | Thursday | 0.91 |
| 1 | Friday | 0.00 |
| 1 | Saturday | 0.05 |
| 2 | Sunday | 0.27 |
| 2 | Monday | 4.26 |
| 2 | Tuesday | 0.43 |
| 2 | Wednesday | 0.00 |
| 2 | Thursday | 0.00 |
| 2 | Friday | 0.16 |
| 2 | Saturday | 1.41 |
| 3 | Sunday | 0.34 |
| 3 | Monday | 0.33 |
| 3 | Tuesday | 0.00 |
| 3 | Wednesday | 0.00 |
| 3 | Thursday | 0.13 |
| 3 | Friday | 0.00 |
| 3 | Saturday | 0.00 |
| 4 | Sunday | 0.00 |
| 4 | Monday | 0.00 |
| 4 | Tuesday | 0.00 |
| 4 | Wednesday | 0.00 |
| 4 | Thursday | 0.00 |
| 4 | Friday | 0.00 |
| 4 | Saturday | NA |
| 5 | Sunday | NA |
| 5 | Monday | NA |
| 5 | Tuesday | NA |
| 5 | Wednesday | NA |
| 5 | Thursday | NA |
| 5 | Friday | NA |
| 5 | Saturday | NA |
pivot_wider()Take a long dataset and turn it into a wide daaset.
id_cols – specify the column(s) that contain the ID for unique rows in the wide dataset.names_from – the name of the column containing the new column names.values_from – the name of the column containing the new column values.pivot_wider()Let’s say we calculate the mean amount of protein for cereals on each shelf and for each manuf.
| manuf | shelf | mean_protein |
|---|---|---|
| A | 2 | 4.000000 |
| G | 1 | 3.000000 |
| G | 2 | 1.285714 |
| G | 3 | 2.666667 |
| K | 1 | 2.750000 |
| K | 2 | 2.142857 |
| K | 3 | 2.916667 |
| N | 1 | 2.666667 |
| N | 2 | 2.500000 |
| N | 3 | 4.000000 |
| P | 1 | 1.500000 |
| P | 2 | 1.000000 |
| P | 3 | 3.000000 |
| Q | 1 | 5.000000 |
| Q | 2 | 2.000000 |
| Q | 3 | 2.500000 |
| R | 1 | 2.000000 |
| R | 3 | 3.000000 |
pivot_wider()We can make this dataset more easily readable…
| manuf | 1 | 2 | 3 |
|---|---|---|---|
| G | 3.000000 | 1.285714 | 2.666667 |
| K | 2.750000 | 2.142857 | 2.916667 |
| N | 2.666667 | 2.500000 | 4.000000 |
| P | 1.500000 | 1.000000 | 3.000000 |
| Q | 5.000000 | 2.000000 | 2.500000 |
| R | 2.000000 | NA | 3.000000 |
| A | NA | 4.000000 | NA |
pivot_wider()| manuf | Shelf_1 | Shelf_2 | Shelf_3 |
|---|---|---|---|
| G | 3.000000 | 1.285714 | 2.666667 |
| K | 2.750000 | 2.142857 | 2.916667 |
| N | 2.666667 | 2.500000 | 4.000000 |
| P | 1.500000 | 1.000000 | 3.000000 |
| Q | 5.000000 | 2.000000 | 2.500000 |
| R | 2.000000 | NA | 3.000000 |
| A | NA | 4.000000 | NA |
Multiple, interconnected tables of data are called relational.
IMDb movie relational data
We can combine (join) data tables based on their relations.
Mutating joins
Add variables from a new dataframe to observations in an existing dataframe.
full_join(), left_join(), right_join(), inner_join(), outer_join()
Filtering Joins
Filter observations based on values in new dataframe.
semi_join(), anti_join()
A key uniquely identifies an observation in a data set.
inner_join()Keeps obsertvations when their keys are present in both datasets.
inner_join(): IMDb Example| director_id | genre | prob |
|---|---|---|
| 429 | Adventure | 0.750000 |
| 429 | Fantasy | 0.750000 |
| 2931 | Drama | 0.714286 |
| 2931 | Action | 0.428571 |
| 11652 | Sci-Fi | 0.500000 |
| 11652 | Action | 0.500000 |
| 14927 | Animation | 1.000000 |
| 14927 | Family | 1.000000 |
| 15092 | Comedy | 0.545455 |
| 15092 | Crime | 0.545455 |
ID: 429, 2931, 11652, 14927, 15092 ID: 429, 9247, 11652, 14927, 15092
| director_id | genre | prob | movie_id |
|---|---|---|---|
| 429 | Adventure | 0.750000 | 300229 |
| 429 | Fantasy | 0.750000 | 300229 |
| 11652 | Sci-Fi | 0.500000 | 10920 |
| 11652 | Sci-Fi | 0.500000 | 333856 |
| 11652 | Action | 0.500000 | 10920 |
| 11652 | Action | 0.500000 | 333856 |
| 14927 | Animation | 1.000000 | 192017 |
| 14927 | Family | 1.000000 | 192017 |
| 15092 | Comedy | 0.545455 | 109093 |
| 15092 | Comedy | 0.545455 | 237431 |
| 15092 | Crime | 0.545455 | 109093 |
| 15092 | Crime | 0.545455 | 237431 |
ID: 429, 2931, 9247, 11652, 14927, 15092
inner_join(): IMDb ExampleWhat if our key does not have the same name?
| director_id | genre | prob |
|---|---|---|
| 429 | Adventure | 0.750000 |
| 429 | Fantasy | 0.750000 |
| 2931 | Drama | 0.714286 |
| 2931 | Action | 0.428571 |
| 11652 | Sci-Fi | 0.500000 |
| 11652 | Action | 0.500000 |
| 14927 | Animation | 1.000000 |
| 14927 | Family | 1.000000 |
| 15092 | Comedy | 0.545455 |
| 15092 | Crime | 0.545455 |
| id | first_name | last_name | genre | prob |
|---|---|---|---|---|
| 429 | Andrew | Adamson | Adventure | 0.750000 |
| 429 | Andrew | Adamson | Fantasy | 0.750000 |
| 11652 | James (I) | Cameron | Sci-Fi | 0.500000 |
| 11652 | James (I) | Cameron | Action | 0.500000 |
| 14927 | Ron | Clements | Animation | 1.000000 |
| 14927 | Ron | Clements | Family | 1.000000 |
| 15092 | Ethan | Coen | Comedy | 0.545455 |
| 15092 | Ethan | Coen | Crime | 0.545455 |
Join by different variables on dataX and dataY: join_by(a == b) will match dataX$a to dataY$b.
Remember: the dataset you pipe in becomes the first argument of the function you are piping into!
join function.left_join() – keep only (and all) observations present in the left data set
right_join() – keep only (and all) observations present in the right data set
full_join() – keep only (and all) observations present in both data sets
Which directors would remain for each of the following?
left_join(directors_genres, movies_directors)right_join(directors_genres, movies_directors)full_join(directors_genres, movies_directors)semi_join()Keeps observations when their keys are present in both datasets, but only keeps variables from the first dataset.
→
semi_join()| director_id | genre | prob |
|---|---|---|
| 429 | Adventure | 0.750000 |
| 429 | Fantasy | 0.750000 |
| 11652 | Sci-Fi | 0.500000 |
| 11652 | Action | 0.500000 |
| 14927 | Animation | 1.000000 |
| 14927 | Family | 1.000000 |
| 15092 | Comedy | 0.545455 |
| 15092 | Crime | 0.545455 |
Movie Directors: 429, 2931, 11652, 14927, 15092
anti_join()Removes observations when their keys are present in both datasets, and only keeps variables from the first dataset.
→
anti_join()Today you will be tidying messy data to explore the relationship between countries of the world and military spending.
Today we will…
forcatsWhen we work with multiple tables of data, we say we are working with relational data.
When we work with relational data, we rely on keys.
How can we find each director’s active years?
| director_id | movie_id | first_name | last_name |
|---|---|---|---|
| 429 | 300229 | Andrew | Adamson |
| 2931 | 254943 | Darren | Aronofsky |
| 9247 | 124110 | Zach | Braff |
| 11652 | 10920 | James (I) | Cameron |
| 11652 | 333856 | James (I) | Cameron |
| 14927 | 192017 | Ron | Clements |
| 15092 | 109093 | Ethan | Coen |
| 15092 | 237431 | Ethan | Coen |
| 15093 | 109093 | Joel | Coen |
| 15093 | 237431 | Joel | Coen |
| 15901 | 130128 | Francis Ford | Coppola |
| 15906 | 194874 | Sofia | Coppola |
| 16816 | 350424 | Cameron | Crowe |
| 17810 | 297838 | Frank | Darabont |
| 22104 | 224842 | Clint | Eastwood |
| 24758 | 112290 | David | Fincher |
| 28395 | 46169 | Mel (I) | Gibson |
| 35573 | 18979 | Ron | Howard |
| 35838 | 257264 | John (I) | Hughes |
| 37872 | 300229 | Vicky | Jenson |
| 38746 | 238695 | Mike (I) | Judge |
| 41975 | 314965 | David | Koepp |
| 44291 | 17173 | John (I) | Landis |
| 46315 | 344203 | Jay | Levey |
| 48115 | 313459 | George | Lucas |
| 56332 | 192017 | John | Musker |
| 58201 | 30959 | Christopher | Nolan |
| 58201 | 210511 | Christopher | Nolan |
| 65940 | 111813 | Rob | Reiner |
| 66849 | 306032 | Guy | Ritchie |
| 68161 | 116907 | Herbert (I) | Ross |
| 74758 | 238072 | Steven | Soderbergh |
| 76524 | 167324 | Oliver (I) | Stone |
| 78273 | 176711 | Quentin | Tarantino |
| 78273 | 176712 | Quentin | Tarantino |
| 78273 | 267038 | Quentin | Tarantino |
| 78273 | 276217 | Quentin | Tarantino |
| 82525 | 147603 | Paul (I) | Verhoeven |
| 83616 | 207992 | Andy | Wachowski |
| 83617 | 207992 | Larry | Wachowski |
| 88802 | 256630 | Unknown | Director |
| director_id | movie_id | first_name | last_name | movie_name | year | rank |
|---|---|---|---|---|---|---|
| 429 | 300229 | Andrew | Adamson | Shrek | 2001 | 8.1 |
| 2931 | 254943 | Darren | Aronofsky | Pi | 1998 | 7.5 |
| 9247 | 124110 | Zach | Braff | Garden State | 2004 | 8.3 |
| 11652 | 10920 | James (I) | Cameron | Aliens | 1986 | 8.2 |
| 11652 | 333856 | James (I) | Cameron | Titanic | 1997 | 6.9 |
| 14927 | 192017 | Ron | Clements | Little Mermaid, The | 1989 | 7.3 |
| 15092 | 109093 | Ethan | Coen | Fargo | 1996 | 8.2 |
| 15092 | 237431 | Ethan | Coen | O Brother, Where Art Thou? | 2000 | 7.8 |
| 15093 | 109093 | Joel | Coen | Fargo | 1996 | 8.2 |
| 15093 | 237431 | Joel | Coen | O Brother, Where Art Thou? | 2000 | 7.8 |
| 15901 | 130128 | Francis Ford | Coppola | Godfather, The | 1972 | 9.0 |
| 15906 | 194874 | Sofia | Coppola | Lost in Translation | 2003 | 8.0 |
| 16816 | 350424 | Cameron | Crowe | Vanilla Sky | 2001 | 6.9 |
| 17810 | 297838 | Frank | Darabont | Shawshank Redemption, The | 1994 | 9.0 |
| 22104 | 224842 | Clint | Eastwood | Mystic River | 2003 | 8.1 |
| 24758 | 112290 | David | Fincher | Fight Club | 1999 | 8.5 |
| 28395 | 46169 | Mel (I) | Gibson | Braveheart | 1995 | 8.3 |
| 35573 | 18979 | Ron | Howard | Apollo 13 | 1995 | 7.5 |
| 35838 | 257264 | John (I) | Hughes | Planes, Trains & Automobiles | 1987 | 7.2 |
| 37872 | 300229 | Vicky | Jenson | Shrek | 2001 | 8.1 |
| 38746 | 238695 | Mike (I) | Judge | Office Space | 1999 | 7.6 |
| 41975 | 314965 | David | Koepp | Stir of Echoes | 1999 | 7.0 |
| 44291 | 17173 | John (I) | Landis | Animal House | 1978 | 7.5 |
| 46315 | 344203 | Jay | Levey | UHF | 1989 | 6.6 |
| 48115 | 313459 | George | Lucas | Star Wars | 1977 | 8.8 |
| 56332 | 192017 | John | Musker | Little Mermaid, The | 1989 | 7.3 |
| 58201 | 30959 | Christopher | Nolan | Batman Begins | 2005 | NA |
| 58201 | 210511 | Christopher | Nolan | Memento | 2000 | 8.7 |
| 65940 | 111813 | Rob | Reiner | Few Good Men, A | 1992 | 7.5 |
| 66849 | 306032 | Guy | Ritchie | Snatch. | 2000 | 7.9 |
| 68161 | 116907 | Herbert (I) | Ross | Footloose | 1984 | 5.8 |
| 74758 | 238072 | Steven | Soderbergh | Ocean's Eleven | 2001 | 7.5 |
| 76524 | 167324 | Oliver (I) | Stone | JFK | 1991 | 7.8 |
| 78273 | 176711 | Quentin | Tarantino | Kill Bill: Vol. 1 | 2003 | 8.4 |
| 78273 | 176712 | Quentin | Tarantino | Kill Bill: Vol. 2 | 2004 | 8.2 |
| 78273 | 267038 | Quentin | Tarantino | Pulp Fiction | 1994 | 8.7 |
| 78273 | 276217 | Quentin | Tarantino | Reservoir Dogs | 1992 | 8.3 |
| 82525 | 147603 | Paul (I) | Verhoeven | Hollow Man | 2000 | 5.3 |
| 83616 | 207992 | Andy | Wachowski | Matrix, The | 1999 | 8.5 |
| 83617 | 207992 | Larry | Wachowski | Matrix, The | 1999 | 8.5 |
| 88802 | 256630 | Unknown | Director | Pirates of the Caribbean | 2003 | NA |
Consider the rodent data from Lab 2.
species_id to the rodent measurements.| genus | species | taxa | species_id |
|---|---|---|---|
| Dipodomys | merriami | Rodent | DM |
| Dipodomys | ordii | Rodent | DO |
| Perognathus | flavus | Rodent | PF |
| Chaetodipus | penicillatus | Rodent | PP |
| Peromyscus | eremicus | Rodent | PE |
| Onychomys | leucogaster | Rodent | OL |
| Reithrodontomys | megalotis | Rodent | RM |
| Dipodomys | spectabilis | Rodent | DS |
| Onychomys | torridus | Rodent | OT |
| Neotoma | albigula | Rodent | NL |
| Peromyscus | maniculatus | Rodent | PM |
| Sigmodon | hispidus | Rodent | SH |
| Reithrodontomys | fulvescens | Rodent | RF |
| Chaetodipus | baileyi | Rodent | PB |
| genus_name | species | sex | hindfoot_length | weight |
|---|---|---|---|---|
| Dipodomys | merriami | M | 35 | 40 |
| Dipodomys | merriami | M | 37 | 48 |
| Dipodomys | merriami | F | 34 | 29 |
| Dipodomys | merriami | F | 35 | 46 |
| Dipodomys | merriami | M | 35 | 36 |
| Dipodomys | ordii | F | 32 | 52 |
| Perognathus | flavus | M | 15 | 8 |
| Dipodomys | merriami | F | 36 | 35 |
| Perognathus | flavus | M | 12 | 7 |
| Dipodomys | merriami | F | 32 | 22 |
| Perognathus | flavus | M | 16 | 9 |
| Dipodomys | merriami | F | 34 | 42 |
| Perognathus | flavus | F | 14 | 8 |
| Dipodomys | merriami | F | 35 | 41 |
| Dipodomys | merriami | F | 37 | 37 |
| Dipodomys | merriami | F | 35 | 43 |
| Dipodomys | merriami | F | 35 | 41 |
| Dipodomys | merriami | F | 33 | 40 |
| Perognathus | flavus | F | 11 | 9 |
| Dipodomys | merriami | F | 35 | 45 |
| Chaetodipus | penicillatus | F | 20 | 15 |
| Dipodomys | merriami | M | 35 | 29 |
| Dipodomys | merriami | M | 35 | 39 |
| Dipodomys | merriami | F | 36 | 43 |
| Dipodomys | merriami | M | 38 | 46 |
| Dipodomys | merriami | M | 36 | 41 |
| Dipodomys | merriami | M | 36 | 41 |
| Dipodomys | merriami | M | 38 | 40 |
| Dipodomys | merriami | M | 37 | 45 |
| Dipodomys | merriami | F | 35 | 46 |
| Dipodomys | merriami | F | 35 | 40 |
| Dipodomys | merriami | F | 35 | 30 |
| Dipodomys | merriami | M | 35 | 39 |
| Dipodomys | merriami | M | 35 | 34 |
| Dipodomys | merriami | F | 37 | 42 |
| Dipodomys | merriami | M | 37 | 42 |
| Perognathus | flavus | F | 13 | 8 |
| Dipodomys | merriami | F | 37 | 31 |
| Dipodomys | merriami | F | 36 | 40 |
| Dipodomys | merriami | M | 36 | 37 |
| Dipodomys | merriami | M | 36 | 48 |
| Dipodomys | merriami | M | 37 | 42 |
| Dipodomys | merriami | F | 39 | 45 |
| Chaetodipus | penicillatus | F | 21 | 16 |
| Dipodomys | merriami | F | 36 | 36 |
| Dipodomys | merriami | M | 36 | 42 |
| Dipodomys | merriami | M | 36 | 44 |
| Dipodomys | merriami | F | 36 | 41 |
| Dipodomys | merriami | F | 36 | 40 |
| Dipodomys | merriami | M | 37 | 34 |
| Dipodomys | merriami | M | 33 | 40 |
| Dipodomys | merriami | M | 33 | 44 |
| Dipodomys | merriami | M | 37 | 44 |
| Dipodomys | merriami | M | 34 | 36 |
| Dipodomys | merriami | M | 35 | 33 |
| Dipodomys | merriami | F | 37 | 46 |
| Dipodomys | merriami | F | 34 | 35 |
| Dipodomys | merriami | M | 36 | 46 |
| Dipodomys | merriami | F | 33 | 37 |
| Dipodomys | merriami | M | 36 | 34 |
| Dipodomys | merriami | F | 36 | 45 |
| Perognathus | flavus | F | 15 | 7 |
| Dipodomys | merriami | M | 37 | 51 |
| Dipodomys | merriami | M | 35 | 39 |
| Dipodomys | merriami | M | 36 | 29 |
| Dipodomys | merriami | F | 32 | 48 |
| Dipodomys | merriami | M | 38 | 46 |
| Dipodomys | merriami | F | 37 | 41 |
| Dipodomys | merriami | M | 37 | 45 |
| Dipodomys | merriami | F | 35 | 42 |
| Dipodomys | merriami | F | 36 | 53 |
| Dipodomys | merriami | F | 35 | 49 |
| Dipodomys | merriami | F | 36 | 46 |
| Perognathus | flavus | F | 13 | 9 |
| Chaetodipus | penicillatus | F | 19 | 15 |
| Perognathus | flavus | M | 13 | 4 |
| Dipodomys | merriami | M | 36 | 48 |
| Dipodomys | merriami | M | 37 | 51 |
| Dipodomys | merriami | M | 38 | 50 |
| Dipodomys | merriami | M | 35 | 44 |
| Dipodomys | merriami | M | 25 | 44 |
| Dipodomys | merriami | M | 35 | 45 |
| Dipodomys | merriami | F | 37 | 45 |
| Peromyscus | eremicus | M | 20 | 19 |
| Dipodomys | merriami | F | 38 | 44 |
| Dipodomys | merriami | F | 36 | 42 |
| Dipodomys | merriami | M | 37 | 39 |
| Dipodomys | merriami | M | 37 | 47 |
| Dipodomys | merriami | M | 36 | 42 |
| Dipodomys | merriami | M | 36 | 49 |
| Dipodomys | merriami | M | 38 | 39 |
| Dipodomys | merriami | F | 36 | 43 |
| Dipodomys | merriami | M | 35 | 50 |
| Dipodomys | merriami | M | 36 | 41 |
| Dipodomys | merriami | M | 37 | 47 |
| Dipodomys | merriami | F | 36 | 37 |
| Dipodomys | merriami | M | 36 | 41 |
| Dipodomys | merriami | F | 36 | 36 |
| Dipodomys | merriami | M | 36 | 45 |
| Peromyscus | eremicus | M | 19 | 20 |
| genus | species | taxa | species_id | sex | hindfoot_length | weight |
|---|---|---|---|---|---|---|
| Dipodomys | merriami | Rodent | DM | M | 35 | 40 |
| Dipodomys | merriami | Rodent | DM | M | 37 | 48 |
| Dipodomys | merriami | Rodent | DM | F | 34 | 29 |
| Dipodomys | merriami | Rodent | DM | F | 35 | 46 |
| Dipodomys | merriami | Rodent | DM | M | 35 | 36 |
| Dipodomys | merriami | Rodent | DM | F | 36 | 35 |
| Dipodomys | merriami | Rodent | DM | F | 32 | 22 |
| Dipodomys | merriami | Rodent | DM | F | 34 | 42 |
| Dipodomys | merriami | Rodent | DM | F | 35 | 41 |
| Dipodomys | merriami | Rodent | DM | F | 37 | 37 |
| Dipodomys | merriami | Rodent | DM | F | 35 | 43 |
| Dipodomys | merriami | Rodent | DM | F | 35 | 41 |
| Dipodomys | merriami | Rodent | DM | F | 33 | 40 |
| Dipodomys | merriami | Rodent | DM | F | 35 | 45 |
| Dipodomys | merriami | Rodent | DM | M | 35 | 29 |
| Dipodomys | merriami | Rodent | DM | M | 35 | 39 |
| Dipodomys | merriami | Rodent | DM | F | 36 | 43 |
| Dipodomys | merriami | Rodent | DM | M | 38 | 46 |
| Dipodomys | merriami | Rodent | DM | M | 36 | 41 |
| Dipodomys | merriami | Rodent | DM | M | 36 | 41 |
| Dipodomys | merriami | Rodent | DM | M | 38 | 40 |
| Dipodomys | merriami | Rodent | DM | M | 37 | 45 |
| Dipodomys | merriami | Rodent | DM | F | 35 | 46 |
| Dipodomys | merriami | Rodent | DM | F | 35 | 40 |
| Dipodomys | merriami | Rodent | DM | F | 35 | 30 |
| Dipodomys | merriami | Rodent | DM | M | 35 | 39 |
| Dipodomys | merriami | Rodent | DM | M | 35 | 34 |
| Dipodomys | merriami | Rodent | DM | F | 37 | 42 |
| Dipodomys | merriami | Rodent | DM | M | 37 | 42 |
| Dipodomys | merriami | Rodent | DM | F | 37 | 31 |
| Dipodomys | merriami | Rodent | DM | F | 36 | 40 |
| Dipodomys | merriami | Rodent | DM | M | 36 | 37 |
| Dipodomys | merriami | Rodent | DM | M | 36 | 48 |
| Dipodomys | merriami | Rodent | DM | M | 37 | 42 |
| Dipodomys | merriami | Rodent | DM | F | 39 | 45 |
| Dipodomys | merriami | Rodent | DM | F | 36 | 36 |
| Dipodomys | merriami | Rodent | DM | M | 36 | 42 |
| Dipodomys | merriami | Rodent | DM | M | 36 | 44 |
| Dipodomys | merriami | Rodent | DM | F | 36 | 41 |
| Dipodomys | merriami | Rodent | DM | F | 36 | 40 |
| Dipodomys | merriami | Rodent | DM | M | 37 | 34 |
| Dipodomys | merriami | Rodent | DM | M | 33 | 40 |
| Dipodomys | merriami | Rodent | DM | M | 33 | 44 |
| Dipodomys | merriami | Rodent | DM | M | 37 | 44 |
| Dipodomys | merriami | Rodent | DM | M | 34 | 36 |
| Dipodomys | merriami | Rodent | DM | M | 35 | 33 |
| Dipodomys | merriami | Rodent | DM | F | 37 | 46 |
| Dipodomys | merriami | Rodent | DM | F | 34 | 35 |
| Dipodomys | merriami | Rodent | DM | M | 36 | 46 |
| Dipodomys | merriami | Rodent | DM | F | 33 | 37 |
| Dipodomys | merriami | Rodent | DM | M | 36 | 34 |
| Dipodomys | merriami | Rodent | DM | F | 36 | 45 |
| Dipodomys | merriami | Rodent | DM | M | 37 | 51 |
| Dipodomys | merriami | Rodent | DM | M | 35 | 39 |
| Dipodomys | merriami | Rodent | DM | M | 36 | 29 |
| Dipodomys | merriami | Rodent | DM | F | 32 | 48 |
| Dipodomys | merriami | Rodent | DM | M | 38 | 46 |
| Dipodomys | merriami | Rodent | DM | F | 37 | 41 |
| Dipodomys | merriami | Rodent | DM | M | 37 | 45 |
| Dipodomys | merriami | Rodent | DM | F | 35 | 42 |
| Dipodomys | merriami | Rodent | DM | F | 36 | 53 |
| Dipodomys | merriami | Rodent | DM | F | 35 | 49 |
| Dipodomys | merriami | Rodent | DM | F | 36 | 46 |
| Dipodomys | merriami | Rodent | DM | M | 36 | 48 |
| Dipodomys | merriami | Rodent | DM | M | 37 | 51 |
| Dipodomys | merriami | Rodent | DM | M | 38 | 50 |
| Dipodomys | merriami | Rodent | DM | M | 35 | 44 |
| Dipodomys | merriami | Rodent | DM | M | 25 | 44 |
| Dipodomys | merriami | Rodent | DM | M | 35 | 45 |
| Dipodomys | merriami | Rodent | DM | F | 37 | 45 |
| Dipodomys | merriami | Rodent | DM | F | 38 | 44 |
| Dipodomys | merriami | Rodent | DM | F | 36 | 42 |
| Dipodomys | merriami | Rodent | DM | M | 37 | 39 |
| Dipodomys | merriami | Rodent | DM | M | 37 | 47 |
| Dipodomys | merriami | Rodent | DM | M | 36 | 42 |
| Dipodomys | merriami | Rodent | DM | M | 36 | 49 |
| Dipodomys | merriami | Rodent | DM | M | 38 | 39 |
| Dipodomys | merriami | Rodent | DM | F | 36 | 43 |
| Dipodomys | merriami | Rodent | DM | M | 35 | 50 |
| Dipodomys | merriami | Rodent | DM | M | 36 | 41 |
| Dipodomys | merriami | Rodent | DM | M | 37 | 47 |
| Dipodomys | merriami | Rodent | DM | F | 36 | 37 |
| Dipodomys | merriami | Rodent | DM | M | 36 | 41 |
| Dipodomys | merriami | Rodent | DM | F | 36 | 36 |
| Dipodomys | merriami | Rodent | DM | M | 36 | 45 |
| Dipodomys | ordii | Rodent | DO | F | 32 | 52 |
| Perognathus | flavus | Rodent | PF | M | 15 | 8 |
| Perognathus | flavus | Rodent | PF | M | 12 | 7 |
| Perognathus | flavus | Rodent | PF | M | 16 | 9 |
| Perognathus | flavus | Rodent | PF | F | 14 | 8 |
| Perognathus | flavus | Rodent | PF | F | 11 | 9 |
| Perognathus | flavus | Rodent | PF | F | 13 | 8 |
| Perognathus | flavus | Rodent | PF | F | 15 | 7 |
| Perognathus | flavus | Rodent | PF | F | 13 | 9 |
| Perognathus | flavus | Rodent | PF | M | 13 | 4 |
| Chaetodipus | penicillatus | Rodent | PP | F | 20 | 15 |
| Chaetodipus | penicillatus | Rodent | PP | F | 21 | 16 |
| Chaetodipus | penicillatus | Rodent | PP | F | 19 | 15 |
| Peromyscus | eremicus | Rodent | PE | M | 20 | 19 |
| Peromyscus | eremicus | Rodent | PE | M | 19 | 20 |
| Onychomys | leucogaster | Rodent | OL | NA | NA | NA |
| Reithrodontomys | megalotis | Rodent | RM | NA | NA | NA |
| Dipodomys | spectabilis | Rodent | DS | NA | NA | NA |
| Onychomys | torridus | Rodent | OT | NA | NA | NA |
| Neotoma | albigula | Rodent | NL | NA | NA | NA |
| Peromyscus | maniculatus | Rodent | PM | NA | NA | NA |
| Sigmodon | hispidus | Rodent | SH | NA | NA | NA |
| Reithrodontomys | fulvescens | Rodent | RF | NA | NA | NA |
| Chaetodipus | baileyi | Rodent | PB | NA | NA | NA |
Factors are used for
day_born = Sunday, Monday, Tuesday, …, SaturdayLet’s consider songs that Taylor Swift played on her Eras Tour.
I have randomly selected 25 songs (and their albums) to consider.
RA character vector:
[1] "Red" "Reputation" "Lover" "Midnights" "1989"
[6] "Fearless" "Reputation" "Folklore" "Midnights" "Evermore"
[11] "Evermore" "Lover" "Lover" "Red" "Reputation"
[16] "Reputation" "Speak Now" "Red" "Midnights" "Fearless"
[21] "1989" "Midnights" "Fearless" "Folklore" "Lover"
A factor vector:
[1] Red Reputation Lover Midnights 1989 Fearless
[7] Reputation Folklore Midnights Evermore Evermore Lover
[13] Lover Red Reputation Reputation Speak Now Red
[19] Midnights Fearless 1989 Midnights Fearless Folklore
[25] Lover
9 Levels: 1989 Evermore Fearless Folklore Lover Midnights Red ... Speak Now
RWhen you create a factor variable from a vector…
RYou can specify the order of the levels with the level argument.
eras_data |>
pull(Album) |>
factor(levels = c("Fearless","Speak Now","Red","1989",
"Reputation","Lover","Folklore",
"Evermore","Midnights")) [1] Red Reputation Lover Midnights 1989 Fearless
[7] Reputation Folklore Midnights Evermore Evermore Lover
[13] Lover Red Reputation Reputation Speak Now Red
[19] Midnights Fearless 1989 Midnights Fearless Folklore
[25] Lover
9 Levels: Fearless Speak Now Red 1989 Reputation Lover Folklore ... Midnights
forcatsWe use this package to…
turn character variables into factors.
make factors by discretizing numeric variables.
rename or reorder the levels of an existing factor.
Note
The packages forcats (“for categoricals”) helps wrangle categorical variables.
forcats loads with tidyverse!fctWith fct(), the levels are automatically ordered in the order of first appearance.
[1] Red Reputation Lover Midnights 1989 Fearless
[7] Reputation Folklore Midnights Evermore Evermore Lover
[13] Lover Red Reputation Reputation Speak Now Red
[19] Midnights Fearless 1989 Midnights Fearless Folklore
[25] Lover
9 Levels: Red Reputation Lover Midnights 1989 Fearless Folklore ... Speak Now
To change a column type to factor, you must wrap fct() in a mutate() call.
I am just using pull() to display the outcome.
[1] Red Reputation Lover Midnights 1989 Fearless
[7] Reputation Folklore Midnights Evermore Evermore Lover
[13] Lover Red Reputation Reputation Speak Now Red
[19] Midnights Fearless 1989 Midnights Fearless Folklore
[25] Lover
9 Levels: Red Reputation Lover Midnights 1989 Fearless Folklore ... Speak Now
fctYou can still specify the order of the levels with level.
eras_data |>
pull(Album) |>
fct(levels = c("Fearless","Speak Now","Red","1989",
"Reputation","Lover","Folklore",
"Evermore","Midnights")) [1] Red Reputation Lover Midnights 1989 Fearless
[7] Reputation Folklore Midnights Evermore Evermore Lover
[13] Lover Red Reputation Reputation Speak Now Red
[19] Midnights Fearless 1989 Midnights Fearless Folklore
[25] Lover
9 Levels: Fearless Speak Now Red 1989 Reputation Lover Folklore ... Midnights
fctYou can also specify non-present levels.
eras_data |>
pull(Album) |>
fct(levels = c("Taylor Swift",
"Fearless","Speak Now","Red","1989",
"Reputation","Lover","Folklore",
"Evermore","Midnights",
"The Tortured Poets Department")) [1] Red Reputation Lover Midnights 1989 Fearless
[7] Reputation Folklore Midnights Evermore Evermore Lover
[13] Lover Red Reputation Reputation Speak Now Red
[19] Midnights Fearless 1989 Midnights Fearless Folklore
[25] Lover
11 Levels: Taylor Swift Fearless Speak Now Red 1989 Reputation ... The Tortured Poets Department
fct_recodeOops, we have a typo in some of our levels! We change existing levels with the syntax <new level> = <old level>.
eras_data |>
pull(Album) |>
fct_recode("folklore" = "Folklore",
"evermore" = "Evermore",
"reputation" = "Reputation") [1] Red reputation Lover Midnights 1989 Fearless
[7] reputation folklore Midnights evermore evermore Lover
[13] Lover Red reputation reputation Speak Now Red
[19] Midnights Fearless 1989 Midnights Fearless folklore
[25] Lover
11 Levels: Taylor Swift Fearless Speak Now Red 1989 reputation ... The Tortured Poets Department
Non-specified levels are not re-coded.
case_whenWe have similar functionality with the case_when() function…
eras_data |>
mutate(Album = case_when(Album == "Folklore" ~ "folklore",
Album == "Evermore" ~ "evermore",
Album == "Reputation" ~ "reputation",
.default = Album),
Album = fct(Album)) |>
pull(Album) [1] Red reputation Lover Midnights 1989 Fearless
[7] reputation folklore Midnights evermore evermore Lover
[13] Lover Red reputation reputation Speak Now Red
[19] Midnights Fearless 1989 Midnights Fearless folklore
[25] Lover
9 Levels: Red reputation Lover Midnights 1989 Fearless folklore ... Speak Now
fct_collapseCollapse multiple existing levels of a factor with the syntax <new level> = c(<old levels>).
eras_data |>
mutate(Genre = fct_collapse(.f= Album,
"country pop" = c("Taylor Swift", "Fearless"),
"pop rock" = c("Speak Now","Red"),
"electropop" = c("1989","reputation","Lover"),
"folk pop" = c("folklore","evermore"),
"alt-pop" = "Midnights")) |>
slice_sample(n = 6)# A tibble: 6 × 3
Song Album Genre
<chr> <fct> <fct>
1 willow evermore folk pop
2 You Belong With Me Fearless country pop
3 Lavender Haze Midnights alt-pop
4 We Are Never Ever Getting Back Together Red pop rock
5 illicit affairs folklore folk pop
6 Look What You Made Me Do reputation electropop
fct_relevelChange the order of the levels of an existing factor.
eras_data |>
pull(Album) |>
fct_relevel(c("Fearless","1989","Taylor Swift",
"Speak Now","Red","Midnights","reputation",
"folklore","Lover","evermore")) |>
levels() [1] "Fearless" "1989"
[3] "Taylor Swift" "Speak Now"
[5] "Red" "Midnights"
[7] "reputation" "folklore"
[9] "Lover" "evermore"
[11] "The Tortured Poets Department"
Unspecified levels remain in the same order at the end.
ggplot2The bars follow the default factor levels.
We can order factor levels to order the bar plot.
full_eras |>
mutate(Album = fct(Album,
levels = c("Fearless","Speak Now","Red",
"1989","Reputation","Lover",
"Folklore","Evermore",
"Midnights"))) |>
ggplot() +
geom_bar(aes(y = Album,
fill = Album)) +
theme_minimal() +
theme(legend.position = "none") +
labs(x = "Number of Songs",
y = "",
subtitle = "Album",
title = "Songs Played on the Eras Tour")ggplot2The ridge plots follow the order of the factor levels.
Inside ggplot(), we can order factor levels by a summary value.
ggplot2The legend follows the order of the factor levels.
full_eras |>
filter(!Album %in% c("1989","Fearless")) |>
group_by(Album, Single) |>
summarise(avg_len = mean(Length)) |>
ggplot(aes(x = Single,
y = avg_len,
color = Album)) +
geom_point(size = 1.5) +
geom_line() +
theme_minimal() +
scale_x_continuous(breaks = c(0,1),
labels = c("No","Yes")) +
labs(y = "",
subtitle = "Average Song Length",
color = "Album")Inside ggplot(), we can order factor levels by the \(y\) values associated with the largest \(x\) values.
full_eras |>
filter(!Album %in% c("1989","Fearless")) |>
group_by(Album, Single) |>
summarise(avg_len = mean(Length)) |>
ggplot(aes(x = Single,
y = avg_len,
color = fct_reorder2(.f = Album,
.x = Single,
.y = avg_len))) +
geom_point(size = 1.5) +
geom_line() +
theme_minimal() +
scale_x_continuous(breaks = c(0,1),
labels = c("No","Yes")) +
labs(y = "",
subtitle = "Average Song Length",
color = "Album")janitor PackagejanitorData from external sources likely has variable names not ideally formatted for R.
Names may…
janitorThe janitor package converts all variable names in a dataset to snake_case.
Names will…
_.As packages get updated, the functions and function arguments included in those packages will change.
Learn more about lifecycle stages of packages, functions, function arguments in R.
A deprecated functionality has a better alternative available and is scheduled for removal.
Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
ℹ Please use `if_any()` or `if_all()` instead.
# A tibble: 3 × 8
Country Notes `Reporting year` `1988` `1989` `1990` `1991` `1992`
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Africa <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 North Africa <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 Sub-Saharan <NA> <NA> <NA> <NA> <NA> <NA> <NA>
You should not use deprecated functions!
Instead, we use…
# A tibble: 3 × 8
Country Notes `Reporting year` `1988` `1989` `1990` `1991` `1992`
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Africa <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 North Africa <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 Sub-Saharan <NA> <NA> <NA> <NA> <NA> <NA> <NA>
A superseded functionality has a better alternative, but is not going away.
Comments from Week 3
The
tidyversepackage automatically loadsggplot2,dplyr,readr, etc. – do not load these twice!Do not use
mutate()+distinct(), when you can usesummarize()!